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1. Introduction 

A number of problems in image processing, scientific computing, and numerical optimi- 
zation are well suited for solution using parallel or pipelined computations. Typically there are far 
fewer processors than elements to be processed, and so some natural aggregation of elements is 
required for partitioning the workload among the processors. For example, a spatial aggregation 
of elements might be appropriate for image processing, whereas an aggregation based on contig- 
uous columns might be appropriate for various matrix computations. The focus of the present 
paper is on this latter type of aggregation, in which a linear ordering of elements is imposed and an 
optimal partitioning of the elements is required that achieves a balanced workload among the 
processors. 

One specific motivating example arises in studying parallel implementations of the 

Cholesky decomposition of a banded positive definite matrix A, as proposed by O’Hallaron 

T 

(1988). The standard Cholesky decomposition A = LL can be computed in a serial manner using 
two alternating stages. The first stage carries out an appropriate normalization of the current 
column; this information is then passed on to other columns during the second stage. In 
O’Hallaron’s pipelined implementation, a set of contiguous columns (defined by the matrix 
bandwidth) is allocated to a smaller number of processors. Again, the Cholesky scheme consists 
of alternating stages of normalization and updating sets of columns. Because of the need for global 
synchronization at the end of each stage, the computational speed is limited by the processor 
having the maximum workload. The overall speed is then determined by the sum of the maximum 
workloads at each stage, and an allocation of columns to processors that minimizes this sum is 
desired. 

Another specific motivating example occurs in the parallel solution of one-dimensional fluid 
flow problems using irregular grid hierarchies (see Berger and Oliger 1984). Nonuniform 
gridding is used to concentrate computational effort in domain regions where the solution changes 
rapidly (as from a shock or turbulence). In this grid hierarchy various uniformly spaced fine grids 
which do not span the entire domain are superimposed onto a uniformly spaced coarse grid which 
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does. The “stages” of such problems correspond to the numerical integration of grids with a 
common spatial separation; inter-stage synchronization is used to enforce data dependencies 
between solutions at differing grid levels. The entire computation is parallelized by dividing the 
domain (or equivalently, the coarse grid points) among the given processors. The linear ordering 
of these grid points reflects the natural constraint that grid point updates depend on contiguous grid 
values. As in the Cholesky decomposition, the processor having the heaviest workload at a given 
stage limits the progress of the entire system. The sum of the maximum workloads at each stage 
again measures the computation’s execution speed. 

The following section provides a mathematical formulation of the general problem of 
minimizing an overall objective based on the sum of maximum workloads. This problem is termed 
the Multistage Linear Array Assignment (MLAA) problem. It is shown in Section 5 that the 
associated decision problem is NP-complete for a general number (r) of stages. Thus, we first 
concentrate on providing efficient polynomial algorithms for the MLAA when r = 2. Two separate 
approaches are developed (in Sections 3 and 4) for the two-stage problem. Section 6 discusses 
how one of these algorithms can be generalized in a natural way to r stages, and it also develops 
heuristic procedures for an arbitrary number of stages. Computational experience with the exact 
and heuristic algorithms is presented in Section 7. 

2. Formulation 

It is supposed that n elements or modules are given, each of which has two associated 
nonnegative processing times Xj, y } > 0 (i = 1, ..., n). There are also available p processors, 
p < n, each of which is capable of handling an arbitrary number of modules. We consider a 
partition of the set of modules { 1, 2, ..., n} into p intervals Ij, I 2 , ..., I p where each interval 
consists of consecutive modules {t, t+ 1, ..., t+k}, k > 0. A (quite reasonable) assumption made 
here is that the processing time for each interval is additive in the processing times of its component 
modules: 
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The objective here is to select intervals Ij, ..., I p for the p processors to minimize z(Ij, ..., I p ), 
where 

z(Ij, I p ) = [max x(I.)] + [max y(l)] . (1) 

The extension to r processing times (stages) is immediate. 

This problem can be formulated on an acyclic network G = (N,E), where the node set 
N = (0, 1 , . . n} and the edge set E consists of all edges (i, j) with i < j. Each edge (i, j) has the 
associated weights x(i, j) and y(i, j), defined by 

x (bj) = X x k ’ y(bj)=X y k' (2) 

i<k<j i<k<j 

Notice that G has m = 0(n 2 ) edges. 

It is easily seen that our two-stage problem is precisely that of finding in G a path of exactly 
p edges from node 0 to node n that minimizes the sum of the maximum edge weight in each 
component. This “min-sum-max” problem is a bicriterion path problem that has not been 
extensively studied, although related bicriterion path problems have been investigated. Warburton 
(1985) considers the “min-max-sum” problem and demonstrates that it is NP-hard for r = 2 
components. The problem is strongly NP-hard for general r. A number of authors (Hansen 1980, 
Henig 1985) have studied the problem of generating “efficient” (Pareto optimal) paths in bicriterion 
networks, using a variety of path length measures. The most pertinent measure of path length to 
our present study is the maximum edge weight for each criterion. A path is termed efficient when 
no other admissible path has a smaller value for one criterion without a larger value for the other. 
Polynomial-time algorithms for this “minmax-minmax” bicriterion path problem have been given 
by Hansen (1980), Berman et al. (1987), and Warburton (1987). Other related work is presented 
by Bokhari (1988), relative to the single-objective problem of minimizing the maximum of the 
maximum x-weight and the additive y-weight, in a doubly weighted network. While the problem 
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studied here is significantly different from these other problems, certain common solution 
strategies exist that have been exploited in our algorithmic development of the next sections. 

Before discussing algorithms for the two-stage case, it is worthwhile to point out certain 
difficulties intrinsic to the problem of minimizing the objective function (1). Consider the two- 
stage problem with processing times x ; , y ; given in Table 1, with p = 3. The (unique) optimal 
partition for this problem has I, = { 1 }, I 2 = (2), I 3 = {3, 4}. However, in the subproblem 
defined over the modules I 2 u I 3 = (2, 3, 4}, the optimal partition for p = 2 processors uses 
I 2 = {2, 3} and I 3 = {4}. Thus the principle of optimality required for a dynamic programming 
formulation does not hold. In other words, optimal solutions cannot be found simply by extending 
(in the best possible way) the known optimal solutions for smaller subproblems. 

Table 1 

A Three Processor Example that Violates the Optimality Principle 
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3. The Labeling Algorithm 

In this section a labeling algorithm is developed to solve the two-stage MLAA problem (1). 

This algorithm works directly on the network G introduced in the previous section and exploits the 

fact that G is acyclic. Since the principle of optimality cannot be invoked for this problem to 

extend optimal paths for a subproblem, we maintain at each node j e N several sets of candidate 

2 

paths. That is, associated with each node j will be certain sets L(j;k) of vectors in R . The set 
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L(j;k) = {(x 1 , y 1 ), (x 2 , y 2 ), (x l , y 1 )} corresponds to “path length vectors” for paths P from 
node 0 to node j having exactly k edges. In this context, the first component x s of such a path 
vector is simply the maximum x- weight of path P, and the second component y s is the maximum 
y-weight of path P. 

2 

To express these notions more precisely, let u, v e R be vectors with u = (uj, u 2 ) and 
v = (vj, v 2 ), and define the product u ® v by 

u ® v = (max(Uj, Vj), max(u 2 , v 2 )) . 

In this notation, the path length vector for path P is given by 

len(P) = ®]~J (x(i, j), y(i, j)) , 

(i.j)eP 

and the label L(j;k) on node j is given by 

L(j;k) = (len(P): P is a path from 0 to j with || p|| = k}. (3) 

Here x(i, j) and y(i, j) are as defined in (2), and || P || is used to denote the cardinality (number of 
edges) of path P. It will also be convenient to extend the definition of ® to sets of vectors: if 
S c R 2 and v € R ~ then 

S®v = {w®v:we S}. 

Throughout, we will maintain sets of path vectors for each node. At the end of the 
process, an optimal path vector to node n will be identified, from which it is straightforward to 
recover an associated path (via a simple backtracking scheme). Consequently, we shall work 
entirely with path vectors. Not all possible path vectors need to be retained at a node, however. 

To this end, a vector x is said to dominate vector y if x < y holds componentwise and at least one 
of the inequalities is strict. (This nomenclature is meant to convey the idea that x is “better” than y; 
however usage in this way is not standard in the literature.) A set S of vectors is efficient (Pareto 
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optimal) if no vector x e S dominates another vector ye S. More generally, the set 
EFFICIENT(S) is obtained from set S by removing all dominated vectors. 

In the algorithm given below, only efficient sets L(j;k) are maintained at each node. This 
algorithm represents a suitably modified version of the algorithm ACYCLIC given by Warburton 
(1987) for obtaining the set of efficient path vectors in a bicriterion shortest path problem. 

LABELING ALGORITHM 

1. L(0;0) = {(0, 0)}; 

L(0;k) = 0, for k = 1, ..., p. 

2. For j = 1, ..., n-1 

L(j;0) = 0; 

For k = 1, ..., p-1 

L(j;k) = EFFICIENT (W L(i;k-1)® (x(i, j), y(i, j))} . 

3. L(n;p) =EFFICIENT{kJ L(i; p-1 ) <S> (x(i, n), y(i, n))} . 

i<n 

4. Select v = (v lt v 2 ) e L(n;p) for which Vj + v 2 is minimum. 

Upon termination, the labeling algorithm produces a set of efficient path vectors L(n;p), 
from which an optimal p-edge path vector for (1) is readily obtained in Step 4. If we also maintain 
the vector labels L(n; k) at node n, sensitivity analysis information will be available on the effect of 
reducing the number of processors from the fixed value p. Of course, optimal partitions for the 
various subproblems using modules { 1 , . . ., j), j < n, are also readily derived from the vector 
labels on node j. It should be noted that the range for k in Step 2 can actually be replaced by 
k = max{ 1, p-n+j), ..., min{p- 1, j}. This type of pruning allows the elimination of paths 
whose cardinality is too small to lead to an optimal path at node n having p edges, and it also 
avoids the examination of lists known to be empty. 

The worst-case time complexity of the labeling algorithm is determined by the complexity 
of calculating EFFICIENT(S) in Steps 2 and 3. Suppose that L max is the maximum size of any set 


L(j; k) produced during the course of the algorithm. By keeping each L(j; k) as a list of vectors, 
ordered by their first component, repeated merge sorts suffice to maintain an ordered list without 
dominated elements. In the worst case there can be O(n) sublists to merge, each of size L max , 
resulting in 0(n logn L max ) work to calculate EFFICIENTS). Overall, the complexity of carrying 
out Steps 2 and 3 is then 0(p n 2 log n L max ). Since each element of the efficient set L(j; k) repre- 
sents a path vector len(P) and since the first component of len(P) can assume at most m = |E| 
distinct values, L max < m = 0(n 2 ) and the overall time complexity of the labeling algorithm can 
be bounded by 0(p n 4 logn). Space requirements are 0(p n L max ) < 0(p n 3 ) in the worst case. In 
practice, however, these estimates would appear to be overly pessimistic and the expected perfor- 
mance of the algorithm should be significantly better. The empirical results presented in Section 7 
confirm this belief. 

4. The Recursive Bisection Algorithm 

This section describes a different approach for solving the two-stage problem, based on 
generating a set of candidate path vectors guaranteed to contain all efficient path vectors from node 
0 to node n. The optimal solution vector can then be easily obtained from this set. The following 
simple observation serves to motivate this second approach. Namely, if P* is an optimal path for 
the two-stage problem, then the optimal objective function value is given by 

z* = [max x(a)] + [max y(b)] = x(a*) + y(b*) , 

ae P* be P* 

where a* and b* are edges of P*. Consider the subgraph of G 
G a = {e e E: x(e) < x(a)} 

consisting of all edges with x- weight at most x(a). Then it is not difficult to find a “best” path in 
G a with respect to the y-weights, a path P a in G a from 0 to n with cardinality p that minimizes the 
maximum y-weight along P a : 
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Such a minimax path can be found using a straightforward modification of the standard Dijkstra 
shortest path algorithm (see Kalaba 1964). Finally, the optimal two-stage solution is found by 
selecting a path P a that minimizes x(a) + y(b* a ). This basic approach would require m calls to the 
minimax (p-edge) path routine, each such invocation requiring 0(p m) work, so the overall time 
complexity is bounded by 0(p m 2 ) = 0(p n 4 ). 

An improved approach that treats the x-weights and y-weights more symmetrically, and 
that has a significantly better complexity bound, will now be developed. First, the x-weights on 
the m arcs of G are sorted into nondecreasing order: x(aj) < x^) < ... < x(a m ). Likewise, the y- 
weights are sorted as y(bj) < y(b 2 ) < . . . < y(b m ). Next, we define the subgraph G a b of G via 

G a> b = (e e E: x(e) < x(a), y(e) < y(b)}. 

The MLAA problem can then be solved by finding edges a, b e E so that [x(a), y(b)] is feasible 
(i.e., G^ b admits a path P from 0 to n with ||P|| = p) and x(a) + y(b) is minimum (over all feasible 

pairs). 

It will be useful to introduce a feasibility matrix Z, whose (i, j) entry z- indicates whether 
the pair [x(aj), y(bp] is feasible (F) or not (I). Because of the previous ordering of x-weights and 
y-weights, whenever z st = F then z-- = F for i > s and j > t. Also, if z st = I then z- = I for i < s and 
j < t. A typical pattern for the matrix Z is illustrated in Figure 1 . The efficient path vectors 
[x(a), y(b)] are indicated by circled entries in the matrix. 


Z = 


'I I I © F 
I I I F F 
I I © F F 
© F F F F 
F F F F F 
F F F F F 


F ‘ 

F 

F 

F 

F 

F 


Figure 1. Example feasibility matrix Z. 
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We would like to generate a candidate set of entries (at most m), guaranteed to contain the 
efficient (circled) entries in the feasibility matrix. Since it is computationally expensive to deter- 
mine the values of entries in this matrix, it is desirable to evaluate as few entries as possible. It is 
somewhat surprising that we can in fact determine the candidate set (having possibly m entries) by 
evaluating only at most O(m) of the entries z-. 

For notational convenience, it will be assumed that m = 2 k for some integer k > 0. Because 
of the ordering properties of Z, derived from the ordering of x-weights and y-weights, we can 
perform a binary search on the elements of the (median) row s = y of Z to find the smallest t so 
that z sl = F. This will be termed a binary search at level 0, and it requires 0(log m) evaluations of 
entries in Z. Entry (s, t) of the matrix then becomes a candidate solution. The ordering properties 
of Z ensure that the submatrix of Z with upper left comer at (s, t) contains only F’s, while the 
submatrix with lower right comer at (s, t-1) contains only I’s. Therefore, it is only necessary to 
explore two remaining submatrices for candidate solutions: the submatrix with lower left comer at 
(s-1, t), and the submatrix with upper right comer at (s+1, t-1). Each submatrix can in turn be 
explored by performing a binary search along a row in median position within the submatrix. If 
the first submatrix has Cj columns and the second has c 2 columns, then the number of evaluations 
needed to carry out these two binary searches is 0(log Cj)+ 0(log c 2 ). By the concavity of the log 
function, y log Cj + j log c 2 < log j (Cj + c 2 ) = log — , and so the number of evaluations 
required for these two level 1 searches is of the order 2 log -y • In a similar way, the 2 l binary 
searches at level I require at most 2 l log ^ evaluations, so the total number of evaluations needed 
is of the order 

log m + 2 (log -j-) +4 ( lo gj) + ••• + -J- (2) + (y) 1 ~ m X j • 

Since the infinite series is convergent, at most O(m) entries of Z will need to be examined. A 
formal statement of this algorithm (RECURSIVE BISECTION) is given below, applied to an 
m x m feasibility matrix Z. 
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RECURSIVE BISECTION ALGORITHM 


[Main Program | 

C = 0 ; 

BISECT (Z, m, m, C); 

Select (i, j)e C with minimum x(aj) + y(bj). 

[Recursive Subprogram] 

procedure BISECT (M, rows, cols, C) 

s = T rows/2l; 

Determine (by binary search) the smallest t 
so that m st = F;C = Cu{(s, t)}; 
if s > 1 then 

Let M, be the submatrix of M with lower left 
comer at (s- 1 , t); 

BISECT (Mj, s-1, cols-t+1, C); 
if s < rows and t > 1 then 

Let M 2 be the submatrix of M with upper right 
comer at (s+1, t-1); 

BISECT (M 2 , rows-s, t-1, C). 

As shown above, this algorithm requires O(m) entries of Z to be evaluated in carrying out 
the various binary searches on partial rows of the input matrix. Once the smallest feasible entry in 
the prescribed row is determined, the candidate set C is updated appropriately. We now discuss 
how any entry of the feasibility matrix can be efficiently computed, when required. 

Recall that the value of an entry in the feasibility matrix is determined by whether there is a 
path in G a b of cardinality p from 0 to n. In view of the definition (2) of edge weights and the 
nonnegativity of processing times, the following monotonicity property is immediate. 


Property 1 . If i < j < k then x(i, j), x(j, k) < x(i, k) and y(i, j), y(j, k) < y(i, k). 
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The next property follows directly from Property 1. 

Property 2 . Suppose that i < j < k and that (i, k) e G a b . Then (i, j) e G a b and (j, k) e G a b . 

Proof : Since (i, k) e G a b we have x(i, k) < x(a) and y(i, k) < y(b). From Property 1, x(i, j) < 
x(i, k) < x(a) and y(i, j) < y(i, k) < y(b) and so (i, j) e G a>b . In a similar way, (j, k) € G a b . 

We are now in a position to describe an algorithm for determining the feasibility of 
[x(a), y(b)]. The basic idea is to start from node 0 and to repeatedly follow the edge in G a b that 
reaches to the node of largest index, until we reach node n. If this constructed path has cardinality 
< p then it will turn out that [x(a), y(b)] is feasible, and conversely. This greedy procedure 
(REACH) for checking feasibility in G a b is stated below. 

procedure REACH(a,b) 

1 . s = 0; card = 0; 

2. While card <p 

Find the largest j so that (s, j) e G a b ; 

card = card + 1 ; 

if j = n then output true and stop 
else s = j; 

3 . Output false. 

The validity of REACH is now established: namely, it will be shown that the output true is 
produced precisely when there is a path of cardinality p in G a b from node 0 to node n. Suppose 
that REACH outputs true. Then we are assured of a path P of cardinality < p in G ab from 0 to n. 
Since p < n, Property 2 ensures that if such a path actually has cardinality < p, then there must be 
some path of cardinality exactly p. (Just successively replace edges (i, j) € P by edges (i, k) and 
(k, j) as needed.) On the other hand, suppose there is a path Q in G a b of cardinality p from 0 to n, 
where Q = [(i 0 , i,), (ij, i 2 ), ..., (i lf i )], with i 0 = 0 and i p = n. Let the path produced by 
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REACH be P = [(j 0 , j,), (J 1 J 2 )’ •••» j t )l. with j 0 = 0. Clearly, since i 0 = j 0 and (i Q , ij) e 
G ab procedure REACH will generate j, > i,. Since (i,, i 2 ) e G ab Property 2 guarantees that 
(j,, i 2 ) e G a b and thus the node j 2 selected by REACH must satisfy j 2 > i 2 . Continuing in this 
way, we can establish that j 3 > i 3 , . . ., j p > i p = n. Hence REACH will detect a path of cardinality 
at most p to node n and output true. 

The time complexity of REACH, and consequently that of the recursive bisection 
algorithm, is now easy to establish. First, we note that the graph G^ b need not be explicitly 
constructed in order to carry out Step 2 of REACH. Namely, by Property 1 the edges emanating 
from node s are automatically ordered by their x- weight and by their y-weight: 

x(s, s+1) < x(s, s+2) < ... < x(s, n); y(s, s+1) < y(s, s+2) < ... < y(s, n). 

Thus, a binary search can be used on the n-s pairs (s, s+1), ..., (s, n) to find the largest index j 
for which x(s, j) < x(a) and y(s, j) < y(b). The time complexity of one iteration of Step 2 is thus 
0(log n), and REACH therefore can be implemented to run in 0(p logn) time. As a result, the 
recursive bisection algorithm has 0(p m logn) = 0(p n 2 logn) worst-case time complexity. Also, 
the initial sorting of x-weights and y-weights can be done in 0(m logm) = 0(n 2 logn) time, so the 
overall time complexity remains at 0(p n 2 logn). The space requirements are dominated by the 
0(n 2 ) storage needed to keep G as well as the sorted x-weights and y-weights. 

In summary, the recursive bisection method presented in this section exhibits a better 
worst-case time complexity than the labeling algorithm of Section 3. Because the former algorithm 
makes repeated use of binary searches (whose worst-case and average-case behaviors are similar), 
it is anticipated that in practice the empirical complexity of the algorithm should closely reflect its 
worst-case bound. This is in contrast to the labeling algorithm, whose empirical performance will 
depend on the number of labels (L max ) actually encountered. 

A final observation concerns the additive assumption made for combining processing times 
into edge weights x(i, j) and y(i, j). The specific (additive) way of combining the x k ’s and the y k ’s 
in (2) is by no means essential. In fact, any monotone nondecreasing function of the processing 
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times can be used. Then Properties 1 and 2 will continue to hold and the resulting algorithm will 
still be valid. 

5. Complexity Results 

In Section 2 we formulated the MLAA problem as a min-sum-max optimization problem on 
a graph. For the case when exactly two stages are associated with each module, two polynomial- 
time algorithms for this problem were presented in Sections 3 and 4. In the present section we 
prove that for r stages the MLAA decision problem is NP-complete. To do this we show that any 
3 -Satisfiability decision problem, a well known NP-complete problem (Garey and Johnson 1979), 
can be reduced to an instance of the MLAA decision problem. Before proceeding with the proof 
we state the aforementioned problems formally. 

3-SAT 

Instance: A collection C = {Cj, c 2 , ..., c m } of clauses on a finite set U = {xj, x 2 , ..., x u } of 
0, 1 variables such that each clause Cj is a disjunction of exactly three literals (x 4 or Xj 
is a literal). 

Decision Problem: Is there a truth assignment for U that satisfies all the clauses in C? 

MLAA 

Instance: A collection of n modules and p processors. Associated with each module i is an r 
dimensional nonnegative real weight vector (wj(i), w 2 (i), ..., w f (i)). 

Decision Problem: Is there a partition of the n modules into p contiguous intervals Ij, . . ., I p 
so that Xv = i max w. (I.) - B, where B is a positive real number? (As in Section 2, 
w k (l) = Z ie i.w k (i).) 

Theorem 1 . The MLAA decision problem is NP-complete. 

Proof : It is easy to see that MLAA is a member of class NP since a nondeterministic turing 
machine need only guess a partition I j, . . ., I p and check in polynomial time whether or not 

£k=i max j w k ( V - B - 
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It remains to show that any instance of 3-SAT can be transformed into a MLAA problem. 
For the purposes of the proof, we define B = 30u, p = 3u + 4m, n = 4u + 6m, r = 2u + 1, and 
F = 24u. Here F represents a parameter whose use will become apparent as the proof proceeds. 
Throughout the proof we will use the term bottleneck weight to refer to max. w. (I.) relative to 
some entry k. The transformation consists of two parts. The first involves constructing four 
consecutive modulds for each Xj e U; the second appends a further 6m modules (described later) 
corresponding to the m clauses. Associate with each of the first 4u modules a weight vector of 
dimension r. The first 2u entries of each weight vector will be positionally associated with the 
elements of U as follows: (xj, Xj, x 2 , x 2 , ..., x u , x u ). Entry 2u + 1 is not associated with a literal 
and will contain some fractional part of F. This entry is used to force particular assignments of 
modules to processors. In particular, the values chosen for F, B, and p will force a bottleneck 
weight of F in entry 2u + 1 for any feasible partition of the modules. 

Each set of four weight vectors associated with a variable Xj are assigned values in a similar 
fashion. Module j = 1 (mod 4) corresponds to the start of a new set of four modules associated 
with a variable. Assume, without loss of generality, that j = 1 (mod 4) corresponds to the i* 
variable. Each entry of the weight vector for module j will be 0, except for entry 2u + 1 which has 
a value of F. The weight vector for module j + 1 has a 2 in entry 2i - 1, an F/2 in entry 2u + 1, and 
0’s elsewhere. The weight vector for module j + 2 has a 2 in entries 2i - 1 and 2i, an F/2 in entry 
2u + 1, and 0’s elsewhere. Lastly, the weight vector for module j + 3 has a 2 in entry 2i, an F/2 in 
entry 2u + 1, and 0’s elsewhere. 

The values given to F, B and p insure that modules j = 1 (mod 4) are assigned to a single 
processor and modules j + 1, j + 2, and j + 3 are assigned to exactly two processors. In particular 
if modules j + 1 and j + 2 are assigned to the same processor then bottleneck weights of 4 and 2 
result for the components associated with and Xj respectively. We interpret this as indicating that 
X| is true since 4 is greater than 2. The alternative assignment indicates Xj is true. 

The second part of the transformation concerns clauses. We construct a sequence of 6 
consecutive modules for each clause in an instance of 3-SAT. Again we associate a weight vector 
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of length r with each of the 6m modules. The construction of these weight vectors follows a 
similar pattern for each of the m six-module sets. Let vsl (mod 6) and j = 4u + v. The weight 
vector for module j has 0’s in entries 1 to 2u and an F in entry 2u + 1. The weight vector for 
module j + 1 has a 2 for the entry corresponding to the first literal in clause c = |"(v - l)/6l + 1, a 
2F/3 in entry 2u + 1 , and 0’s elsewhere. The weight vector for module j + 2 has a 1 for the entries 
corresponding to the first two literals of clause c, an F/3 in entry 2u + 1, and 0’s elsewhere. The 
weight vector for module j + 3 has a 1 for the entry corresponding to the second literal of clause 
c, an F/3 in entry 2u + 1, and 0’s elsewhere. The weight vector for module j + 4 has a 1 for the 
entries corresponding to the last two literals of clause c, an F/3 in entry 2u + 1, and 0’s elsewhere. 
Lastly, the weight vector for module j + 5 has a 2 for the entry corresponding to the third literal of 
clause c, a 2F/3 in entry 2u + 1, and 0’s elsewhere. 

In view of the values chosen for F, B, and p, each module j = 4u + v, with v s 1 (mod 6) 
must be assigned to a single processor and modules j + 1, ..., j + 5 must be partitioned among 
exactly three processors. Any feasible partition of modules j + 1, ..., j + 5 into three intervals 
must have a bottleneck weight of no more than F in entry 2u + 1. In addition, such a feasible 
partition will result in a bottleneck weight of 3 for at least one (possibly two) of the entries 
(literals). Any literal associated with a bottleneck weight of 3 is considered as one satisfying the 
clause. Note that if the literal was assigned the value true earlier in the variable to module assign- 
ment, that entry will already have a bottleneck weight of 4 which will mask the 3. However, if a 
literal selected by the clause was not assigned the value true earlier, that entry’s bottleneck weight 
of 2 will be increased by the value 3 created in the modules associated with the clauses. 

This transformation of an instance of 3-SAT into an instance of MLAA is best illustrated by 
an example. Suppose S = (Xj v x 2 v x 3 ) a (x t v x 2 v x 3 ) is the given 3-SAT problem, so that 
U = (xj, x 2 , x 3 ), u = 3, m = 2, n = 24, B = 90, F = 72, and p = 17. The module weight 
vectors corresponding to the three elements of U are given in Figure 2 and the module weight 
vectors corresponding to the clauses are given in Figure 3. One solution for S has Xj, x 2 , and x 3 
true. A module to processor assignment corresponding to this solution is given in Figure 4. The 
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bottleneck vector associated with the solution in Figure 4 is (4, 2, 4, 2, 2, 4, F). The component- 
wise sum of this vector is F 4 18 - B, as expected. 

Finally we must show that a given instance of 3-SAT is solved if and only if the corres- 
ponding instance of MLAA has a solution with a cost of no more than B = 30u. Suppose that a 
given instance of 3-SAT has a solution. Our prior comments guarantee that the instance of MLAA 
constructed from the given instance of 3-SAT has an assignment for which the sum of the Xj and Xj 
bottleneck weights is 6 for all i = 1, . . u and the bottleneck weight in the last entry is F. There- 
fore we have a solution with total weight 6u + F = B. No solution to the constructed instance of 
MLAA with cost less than B is possible since entry 2u + 1 will always contribute a bottleneck 
weight of F and the bottleneck weights corresponding to Xj and Xj contribute a sum no less than 6, 
for a total weight of at least B. It follows then that any solution to the instance of MLAA will 
imply a variable assignment that is a solution for the corresponding instance of 3-SAT. 
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Figure 2. Module weights corresponding to the variables of U. 
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Since we can solve the MLAA decision problem by finding a min-sum-max path with p 
edges in an acyclic graph, it follows that the min-sum-max path decision problem is NP-complete 
for acyclic graphs. We state this as a corollary. 

Corollary 1 . The Min-Sum-Max decision problem for acyclic graphs is NP-complete. 

The NP-completeness of the MLAA decision problem depends on treating r as a problem 
variable. For fixed r the natural extension of the labeling algorithm in Section 3 has polynomial- 
time complexity. To describe this extension, the only change needed from our previous descrip- 
tion is in the efficient set calculation. The following discussion centers on the labeling algorithm 
pseudocode and complexity analysis given in Section 3. 

Let W(i, j) = (Wj(i, j), w 2 (i, j), . . w r (i, j)) be the weight vector associated with the edge 
between nodes i and j. Each computation of the form L(i;k-1) ® W(i, j) performed in Steps 2 and 
3 requires 0(r L max ) time; constructing the list Lj = L(i;k-1) ® W(i, j) requires 0(n r L max ) 

time. Notice that Lj has length L = 0(n L max ). Kung et al. (1975) describe an 0(L log r ‘ 2 L ) time 
algorithm (for r > 2) that can be used to calculate EFFICEENT(Lj). Since L max has 0(m r_1 ) length, 
a fact easily shown by induction, this extended labeling algorithm has 0(p n 2r log r ' 2 (n 2M )) time 
complexity. For fixed r, MLAA can thus be solved in polynomial time, albeit a polynomial of high 
degree. The following section outlines certain refinements for improving the performance of this 
type of algorithm. 

6. Algorithms for r > 2 

The recursive bisection algorithm of Section 4 can also be generalized for r > 2 stages: 
REACH extends in a natural way, and the feasibility matrix becomes an r-dimensional array. 
However, the search process does not scale well. In the two-stage problem, knowledge that (x, y) 
is the least costly feasible solution in a row allows us to ignore at least one-half of the rectangle 
being searched and creates two smaller rectangles to search recursively. In an r-stage problem we 
may similarly find the least costly feasible solution within a “row” (vectors that differ only in a 
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given dimension). However, this identification allows us to eliminate as little as l/2 r_1 of the 
hyper-rectangle being searched and creates up to 2 r -2 hyper-rectangles that must be searched 
recursively. Consequently the remainder of this section focuses on extensions of the labeling 
algorithm for r > 2. Specifically, we describe computational refinements of the labeling algorithm, 
together with two heuristics suitable for problems that are too large for exact solution. 

We have tested two algorithms for determining EFFICIENT(Lj): the algorithm of Kung et 

al. (1975) and an algorithm that does domination checking on the fly. On all the problems tested, 

the former algorithm ran an order of magnitude slower. Therefore, the less sophisticated (but for 

our problems, apparently faster) second approach for calculating L(j;k) has been subsequently 

used. This procedure constructs, then accepts or rejects, each potential member of L(j; k). An 

accepted member may cause previously accepted members to be discarded, and it is entered only 

once into L(j; k). A potential member is immediately rejected if it is found to be dominated by an 

. 2 

existing member of L(j; k). This algorithm, listed below, requires 0(n r L^,.) time to construct 
L(j;k). 

CONSTRUCTION ALGORITHM 

1. L(j;k) = 0; 

2. For i = 1, ..., j-1 

3. For all U e L(i; k- 1) 

T = U ® W(i, j); 

For all Ve L(j;k) 
if T<V 

L(j;k) = L(j;k) - {V} 
else if T> V 
go to 3; 

L(j;k) = L(j; k) u {T}. 

Let T e L(j; k) and suppose there is a path vector X e L(j; h), h < k, such that X < T 
(componentwise vector comparison). If P and Q are subpaths corresponding to path vectors T and 
X respectively, then in any solution path (with p edges) that uses P, the subpath P can be replaced 
with subpath Q, yielding a new solution path (with < p edges) having the same or an improved 
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solution cost. It follows from Property 1 of Section 4 that such a path can be extended to a 

solution path with exactly p edges without degrading the solution cost. This observation leads to 

our first pruning rule - a potential member T of L(j;k) is rejected if T > X for some X e L(j;h) 

with h < k. Use of this rule now increases the worst-case complexity of the L(j;k) calculation to 
2 

0(p n r L max ). Nevertheless, in empirical testing the rule was found to eliminate a large number of 
paths, and as a result improved the overall execution time. 

Further pruning is possible if we know the sum-max cost c of a feasible solution. Let T be 
any potential member of L(j ; k), corresponding to subpath P. The component sum of T is a lower 
bound on the cost of any solution path using P. If the component sum of T exceeds c, we know 
that any solution involving P has cost exceeding c and must therefore be suboptimal. The vector T 
is then eliminated by this forward pruning rule. 

Backward pruning is also possible. For any fixed j, the vectors W(i, j) increase monoton- 
ically in each component as i decreases. Thus if the sum of W(i Q , j 0 )’s components exceeds c, 
then for all i < iQ we know that (i, j 0 ) cannot be an edge in an optimal path. For every j we define 
low(c, j) to be the smallest index i such that the component sum of W(i, j) is less than c. Then in 
Step 2 of the construction algorithm above the lower bound of the i loop can be replaced with 
low(c, j). Backward pruning also allows us to free space used to represent path vector lists, an 
important consideration in extending the size of problems that can be solved (see Section 7). 

As will be described in Section 7, the extended labeling algorithm (with the above pruning 
rules incorporated) has been successfully used on a number of nontrivial examples; in one case we 
solved a 1024 module, four-processor, four-stage problem, and in another case we solved a 128 
module, sixteen-processor, eight-stage problem. Some problem classes (as the fluid flow problem 
mentioned in Section 1) tend to have a substantial number of modules, thus motivating the use of 
heuristics for the MLAA problem. Heuristics are also useful in providing feasible solutions for the 
forward and backward pruning rules, and are valuable even for the two-stage problem when n is 
so large that the 0(p n 2 log n) complexity is daunting. 
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The heuristics to be developed here are built on the ability to solve the one-stage problem 
very efficiently, in 0(p n log n) time. This efficient technique, based on the methodology devel- 
oped in Nicol (1988), is now briefly described. First note that the minimum bottleneck value for 
the one-stage problem is identically one of 0(n 2 ) edge weights. A one-stage version of REACH 
establishes the feasibility of any candidate weight in 0(p log n) time. The set of all edge weights 
x(i, j) can be conceptually organized in an upper triangular matrix, with i and j being the row and 
column indices, respectively. By Property 1, the entries in any row increase monotonically in the 
columns, while column entries decrease monotonically in the rows. The recursive bisection 
algorithm of Section 4 is then easily modified to find the minimum feasible edge weight using O(n) 
REACH calls, showing that the one-stage MLAA problem can be solved in 0(p n log n) time. 

Our approach to developing heuristics proceeds by projecting an r-stage problem onto a 
one-stage problem, which is then optimally solved. An optimal partition for the projected problem 
yields an approximate solution for the r-stage problem. It will also be seen that the ability to solve 
one-stage problems efficiently aids us in evaluating the quality of various heuristic solutions. 

Two projections are quite natural. The first takes a module’s r-stage weight vector and 
transforms it into a single-stage weight by summing all the weights. The second method trans- 
forms a module’s weight vector by selecting the maximum component. The resulting methods are 
termed the sum-projection and the max-projection heuristics, respectively. The sum-projection 
heuristic is of particular interest here. As will be seen in Section 7, this heuristic produces better 
partitions in practice than the max-projection heuristic. Furthermore, it will now be shown how 
the quality of a partition it produces can be bounded, relative to the (unknown) optimal partition. 
This bound is particularly tight when the variance in module weights is small. 

For i = 1, . . ., p let Wj be the sum of execution times, at all stages, of all modules assigned 
to processor i under the partition produced by the sum-projection, with W max the largest among 
these. Similarly, let V ; be the sum of all workloads on processor i under the optimal partition (with 
respect to the r-stage problem), and let V max be the largest of these. By the optimality of the one- 
stage partition for the sum-projection, W max < V max . Now for j = 1, ..., r let bj be the bottleneck 
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weight at stage j for the sum- projection partition, and let sp(j) denote the corresponding stage j 
bottleneck processor. Then for some positive otj < l we have bj = otj W sp ^. The r-stage cost of 
the sum-projection partition is thus C sp = 2^ =1 Oj W sp( ^, yielding 
r 

C S p — ®j ^max • (4) 

j=l 

Let C opl be the cost of an optimal r-stage partition. Clearly V max < C t , because represents 
the work that must be done by one processor whereas C opt is the total system finishing time. Thus 
we have W max < V max < C opt , and consequently 


w /c < c ,/C . 

max sp opt sp 


Now we would like to develop bounds for the left hand side of this inequality. 

For every module i, let and m (l) denote the respective maximum and minimum stage 
weights, with p ; = m (l) /M (l) and p* = min{p;: i = 1, ..., n}. Now Oj is the fraction of processor 
sp(j)’s total workload resident at stage j, and otj is consequently maximized if all of the stage j 
module weights assigned to sp(j) are as large as possible, and if all of the other stage weights for 
sp(j) are as small as possible. Without loss of generality suppose that k modules are assigned to 
sp(j) and that sp(j) is processor one. We can then bound otj from above using 


a. < 

j 

I 



^ |"(r-l)m (i) + M (l) 

i L- -I 


( 6 ) 


If Sj, . . ., s k and tj, . . ., t k are nonnegative reals then a straightforward induction on k shows that 


k_ 

S. 

l 

i=i < max 

lc - l < i <k 

!•, 

i=l 
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Applying the above to inequality (6) produces 


a 

j 


< max 
1 < i <k 


M 


(i) 


l(r-l)m (i) + M ( ° 


max 


1 


1 


l < i <k (r-l)p. + II (r-l)p* + 1 


This allows us to bound C sp in (4) from above, giving 

C < r ^max 

sp " (r-l)p* + 1 ‘ 


(7) 


Another bound on C sp can be constructed in an analogous fashion. Let S® and denote 

— , r (j) 

(respectively) the maximum and minimum weights over all modules at stage j, with a = £j =1 s • 
Then we have ctj < Xj, with 


x. = 

j 


,U) 


o - s® + S® ' 


This yields the second bound 

C S p s tW„„ , (8) 

r 

where x = X , x. . 

j=i j 

Applying the above bounds (7), (8) on C sp to (5) produces max{ 1/x, l(r-l)p* + l]/r} < 
C 0 p t /C S p, which implies in particular that 

max { 1/x, p*, l/r}<C opt /C sp . (9) 

Relation (9) provides in turn an upper bound on how much the cost of the heuristic solution varies 
from the cost of the optimal solution, thus indicating the quality of the sum-projection heuristic. 

The bounds above are especially encouraging when module weight variation, either within 
a module or within a stage, is known to be low. In the extreme case when a module’s stage 
weights are all equal (so p* = 1), or when all module weights within any stage are equal (so 
x - 1), relation (9) shows that C sp < C opt and thus the heuristic will in fact produce an optimal 
partition. In Section 7 we will encounter a problem class for which the value of p* is no smaller 
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than 1/2. We can then be assured that the sum-projection solution for this problem class is never 
more than twice as costly as the optimal solution. On the other hand, the 1/t bound is useful if any 
module weights are zero (forcing p* = 0). One class of problems studied in Section 7 frequently 
has zero module weights. 

The a priori bound (9) on solution quality is less satisfying when 1/T, p*, and 1/r are all 
small. An alternative a posteriori approach is based on our ability to quickly construct a lower 
bound on C opt . Namely, any r-stage problem can be decomposed by stage into r separate one- 
stage problems. Let a- be the optimal bottleneck value of the one-stage problem constructed from 
stage j module weights, and let optj be the stage j bottleneck value under an optimal partition for the 
r-stage problem. Clearly we must have aj < optj so that Xj =1 ^ < C opt . Given the cost C h of the 
partition produced by some heuristic, the ratio C opl /C h is bounded from below by the ratio 
(s'. j aj) /C h . Each aj can be found in 0(p n log n) time, making this a feasible mechanism for 
bounding the quality of any partition produced by a heuristic. Our experience with both exact and 
heuristic solution methods for r stages is summarized in the following section, together with 
computational evidence concerning the bounds on solution quality. 

7. Computational Experience 

We have implemented the extended labeling algorithm, as well as the two projection 
heuristics, in the C language on a networked, diskless SUN/3 workstation having 4 Mb of 
memory. This section describes the computational results obtained with these algorithms on three 
problem types. The performance benefits gained by using one of these algorithms as a 
preprocessor for a multiprocessor fluid flow code are also discussed. 

In our study of the extended labeling algorithm, the number of modules (n) was chosen to 
be a power of 2. The upper limit on n varied by problem type, with the largest n attempted being 
1024. In addition, p was chosen to be a power of two (between 2 and 16), and r was a power of 
two (between 2 and 8). Three classes of test problems were studied. In the first class, all module 
weights were randomly chosen from the uniform probability distribution on [1, 10001]. In the 
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second class, weights for any given stage were created using a sine function over |0, 27t] with 
amplitude 50, frequency 2, and translated upward in the y direction by the constant 150; all module 
weights are consequently between 100 and 200. To generate the weight for module i, we evaluate 
the sine function at x = i-2jt/(n-l) and retain the integer portion of that evaluation. The “phase 
constant” for such a curve is the smallest x e [0, 2it] at which the derivative is equal to 1. The 
curves for different stages have different phase constants: the phase constant for stage 1 is 0, and 
for stage j > 1 it is n/2i'K Figure 5 illustrates this weight generation procedure. The third class 
contains problems modeled on the fluid flow example described in Section 1. For this class we 
used at most four stages; these problems frequently have module weights of zero in regions where 
finer grids are absent. 


-C 


n 




2 3 4 5 6 


x 


Figure 5. Module weights in a four-stage sine wave problem. 
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This computational study was devised to (i) determine how large a problem can be solved 
exactly, (ii) determine whether the refinements proposed improve our ability to solve larger prob- 
lems, and (iii) assess the quality of solutions produced by the heuristics, when compared to the 
exact solution obtained by the labeling algorithm. In this empirical study, problems from a given 
class with identical size parameters behaved similarly. Consequently, we feel justified in reporting 
the results from a single (representative) run. In the tabulations that follow, two problems differ- 
ing only in the number of processors had the same module weights. 

Table 2 presents a summary of results obtained using the extended labeling algorithm 
(utilizing all refinements described in Section 6). Results for each problem class, indexed by the 
number of modules (n) and the number of processors (p), display three performance measures. 

The first measure is elapsed processing time for the problem, in seconds. This time excludes that 
required to generate a feasible solution using the sum-projection heuristic, as this amount was 
insignificant in comparison to the execution time required by the extended labeling algorithm. If 
the running time has an associated asterisk, that problem has four stages; otherwise it has eight 
stages. The second measure is the speed-up factor from using the refinements: namely, the ratio of 
processing time without refinements (the basic algorithm) to that utilizing the refinements. The 
third measure is the ratio of the maximum number of path vectors in use at any time while running 
the basic version, divided by the same quantity for the version with refinements. The maximum 
number of path vectors used is an important measure, as it quantifies the amount of storage needed 
by an algorithm. Indeed, for the system configuration used in conducting the tests, the limiting 
factor in solving the largest problems turned out to be the available storage rather than the compu- 
tation time. The symbol (-) is used in Table 2 to indicate when the given problem could not be 
solved by the basic algorithm with the available resources. 

Table 2 shows that for a variety of problem types the extended labeling algorithm can solve 
problems of reasonable size in a reasonable amount of time. These results also show that the 
refinements proposed can significantly reduce the algorithm’s execution time and storage require- 
ments, and in doing so increase the size of problems that can be solved exactly. 
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We are also interested in the quality of solutions produced by the proposed heuristics, and 
in the behavior of the a priori and a posteriori bounds on C opt /C sp developed in Section 6. We 
have applied the sum-projection and max-projection heuristics to each problem listed in Table 2 and 
have computed the ratio of the optimal cost divided by the heuristic cost. We have also computed 
the a priori and a posteriori bounds for the sum-projection heuristic. Table 3 displays the sample 
mean and standard deviation of the individual problem measurements for each problem class. For 
the problems studied here, the sum-projection always achieved a better solution than the max- 
projection, and it typically obtained solutions quite close to optimal. The sum-projection’s a priori 
bounds tend to be low, due to significant variation in the module weights. On the other hand, the a 
posteriori bounds are reasonably tight, thereby justifying the additional effort needed to compute 
them. 

The sum-projection heuristic has been used to produce partitions for a multiprocessor 
implementation of the fluid flow problem. The multiprocessor employed is a Flex/32, at the 
NASA Langley Research Center. The Flex/32 is a shared-memory architecture, with eighteen 
processors available for parallel processing. The grid problems studied consisted of 2048 coarse 
grid points and four stages; each problem was partitioned for sixteen processors. The flow 
problems studied tended to concentrate fine grids at one end of the domain. We compared the 
execution time of the sum-projection method against that of the simplest partitioning method, 
which simply divides the domain into sixteen equi-length pieces. On a suite of five such problems, 
computations partitioned using the sum-projection heuristic ran more than twice as fast compared 
to the equi-length partition. Data taken from these runs is presented in Table 4. This provides 
clear evidence that substantial performance benefits can be gained by using the MLAA solution 
techniques on real problems, implemented on real parallel architectures. 


Table 2 

Performance of the Extended Labeling Algorithm on Three Problem Classes 
(Four-stage problems are marked with *, all others involve eight stages) 
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Uniform Random Problems 
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time 

time 
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time 

time 

space 

n 

(secs) 

ratio 


(secs) 

ratio 

ratio 

(secs) 

ratio 

ratio 

32 

3 

3 

3.5 

8 

6 

6.1 

156 

— 

— 

64 

14 

1 

3.6 

99 

6 

5.9 

286 

- 

- 

128 

10 

3 

3.4 

2147 

- 

- 

409* 

- 

- 

256 

24* 

2 

3.1 

1139* 

- 

- 





Sine Wave Problems 
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n 

(secs) 

ratio 

ratio 

(secs) 

ratio 

ratio 

(secs) 

ratio 

ratio 

32 

6 

1 

2.6 

6 

2 

4.5 

7 

2 

7.3 

64 

8 

2 

2.7 

16 

5.3 

8.2 

62 

13 

9.2 

128 

15 

3 

2.9 

406 

- 

- 

2960 

- 

- 

256 

190* 

- 

- 
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Fluid Flow Grid Problems 
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space 

time 
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n 

(secs) 

ratio 

ratio 

(secs) 

ratio 

ratio 

(secs) 

ratio 

ratio 

32 

1 * 

3 

1.8 

1* 

3 

3.0 

1* 

4 

3.6 

64 

4* 

1 


4* 

2 

4.0 

4 * 

4 

6.0 

128 

5* 

4 


17* 

3 

5.4 

33* 

6 

5.3 

256 

21 * 

3 

2.8 

191* 

- 

- 

232* 

- 

- 

512 

73* 

35 

3.1 

5534* 

- 

- 




1024 

4524* 

— 









Table 3 

Measurements and Lower Bounds on Ratio of Optimal Cost Divided by Heuristic Cost 



Uniform 

Sine 

Grid 

ratio 

mean 

deviation 

mean 

deviation 

mean 

deviation 

max 

0.92 

0.04 

0.94 

0.20 

0.79 

0.11 

sum 

0.97 

0.02 

0.98 

0.01 

0.85 

0.08 

a priori 

0.30 

0.07 

0.68 

0.05 

0.26 

0.02 

a posteriori 

0.87 

0.05 

0.90 

0.03 

0.70 

0.15 









Table 4 

Fluid Flow Problem Execution Times (in seconds) 
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run 

serial 

time 

MLAA 

time 

MLAA 

speedup 

standard 

time 

standard 

speedup 

i 

149.9 

14.9 

10.0 

32.8 

4.5 

2 

142.1 

14.3 

9.9 

30.8 

4.6 

3 

130.2 

13.4 

9.7 

30.0 

4.3 

4 

110.2 

12.5 

8.8 

28.4 

3.8 

5 

101.6 

11.7 

8.7 

22.9 

4.4 


8. Summary 

This paper has considered the problem of optimally assigning modules to processors, so 
that the sum of maximum workloads at each stage is minimized. After formulation of this problem 
as a multi-objective network optimization problem, two polynomially-bounded algorithms were 
developed for the case of r = 2 stages. The general problem (for arbitrary r) was demonstrated to 
be NP-hard. In order to solve large-scale problems, two heuristic procedures were developed 
based on projecting an r-stage problem onto an efficiently solvable one-stage problem. Bounds on 
the solution quality produced by the heuristic methods were obtained using a priori and a posteriori 
information. Computational experience was presented with one of the exact algorithms, incorpor- 
ating certain pruning rules and making use of a feasible solution produced by the sum-projection 
heuristic. The empirical results demonstrate that the exact method, suitably refined, can solve 
some reasonably large problems. For larger problems the sum-projection heuristic procedure 
shows a good deal of promise, exemplified by its success in partitioning processors for an actual 
fluid flow problem. 
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